KnowledgeMiner - Ask Experts

Ask The Experts

Expert Advice

We are constantly working to improve our products, and we keep your comments and questions in mind. Please write to us at julian@scriptsoftware.com. We will respond as quickly as possible.

KnowledgeMiner Discussion Forum

You can contribute to our discussion forum by posting comments, results or answers. Here you may also find information to your questions.

Easy Learning

Please check out the link to the Easy Learning to get a quick leg up in using KnowledgeMiner. This page was created by our user Bert Altenburg. Thanks Bert.

FAQs

Here are some Frequently Asked Questions from our customers.

Q: What does GMDH stand for?

A: GMDH - Group Method of Data Handling. It is a statistical learning network technology using the cybernetical approach of self-organization including systems, information and control theory and computer science. GMDH is not a traditional statistical modeling method. It is an interdisciplinary approach to overcome some main disadvantages of statistics and NN's. Below is a description of GMDH from the preface to Farlow's Book.

"In statistics nowadays there is a distinguishable trend away from the restrictive assumptions of parametric analysis and toward the more computer-oriented area of what is generally known as nonparametric data analysis. One of the more fascinating concepts from this new generation of research is what is known as the GMDH algorithm, which was introduced and is currently being developed by the Ukrainian cyberneticist and engineerA.G Ivakhnenko. What is known these days as a heuristic, the GMDH algorithm constructs high-order regression-type models for complex systems and has the advantage over traditional modeling in that the modeler can more-or-less throw into the algorithm all sorts of input/ output types of observations, and the computer does the rest. The computer self-organizes the model from a simple one to one of optimal complexity by a methodology not unlike the process of natural evolution. It is the purpose of this book to introduce to English-speaking people the basic GMDH algorithm, present variations and examples of its use and list a bibliography of all published work in this growing area of research."

S. J. Farlow, Self-Organizing methods in Modeling. GMDH Type Algorithm (1984)

You can find a short intro in Paper 1 (section Self-organizing modeling technologies) on our web site. You may also want to look at the publications area for more information.

Q: Would you consider your products suitable for financial and product-demand forecasting using numerous variable inputs?

A: Yes, this is one of the primary application fields for KnowledgeMiner. In contrast to statistics or NN's you can use more variables than samples available for modeling. For example, you can create a prediction model (lin. system of equations e.g.) of 40 variables, but only 30 observations for each variable are available. You can consider up to 500 input variables (lagged and unlagged) in KnowledgeMiner to model complex time processes. Additionally, KnowledgeMiner has implemented Analog Complexing as an extremely powerful prediction technique for fuzzy processes like financial markets. KnowledgeMiner when used on financial markets could really strike gold!

Q: It "feels" like KnowledgeMiner might assist in detecting relationships among certain patient groups by clinical criteria vs. fluid measurements that may be missed by an individual. If I understand the application of KnowledgeMiner, I believe I should be able to take our database of eye features along with the diopter measurements for each patient in the database, plug those into KnowledgeMiner and then KnowledgeMiner will derive an equation for calculating the diopter measurement of a patient as a function of the patient's image features. Is this true?

A: Yes, exactly. This is something KnowledgeMiner can do.

Q: First, I'd like to complement you on your choice of platform ;-). With Motorola's new math libraries and the higher clock speeds of their chips, math intensive applications such as KnowledgeMiner (KM) are best run on a Mac; Byte's recent benchmarks show the new Macs running twice as fast in SpecInt and 50% faster in SpecFPU than comparable P5 or P6 chips running at the same speed.
I'm a defense contractor in the U.S. and I also work as a consultant doing image processing programming and object classification work. I downloaded the KM Demo last night and was very impressed with what I think I saw! Congradulations on a very impressive algorithm and its implementation into a GUI that everyone can use. One of the difficulties with Statistical Pattern Recognition in my application is that one might not get a sufficiently sophisticated classifier to give the best possible results (for example using a linear classifier instead of a more complex quadratic classifier). It appears that KM does not suffer from this problem because is appears to produce a bonafide nonlinear equation which should optimally accommodate any irregular shaping of the class populations in feature space. Is this true?

A: Yes, you are correct. One important feature of KnowledgeMiner is that it creates models in an evolutionary way: From very simple models to increasingly more complex ones. It stops automatically when an optimally complex model is found. That is, when it begins to overfit the design data (the data used to create relationships between variables).

Q: The possibility of time lag model is really interesting too. In human training studies, the number of measures per year is very low (2-6), compare to testing variables (10-20). How many subjects are necessary too?

A: The same is true if you want to create a dynamic model. In contrast to statistics or Neural Networks, KnowledgeMiner can deal with a very small number of cases (6+). In fact, the number of cases used for modeling can be smaller than the number of variables (so-called under-determined tasks). So, it is really possible for you to use 10 variables and 6-10 samples only for creation of a linear system of equations.

Q: What would be the largest table (columns and rows) KnowledgeMiner could accommodate if allocated 100 MB of RAM?

A: The table contains approximated values as an orientation:

100 rows 500 inputs

200 350

300 280

400 240

500 210

1000 150

2000 110

5000 70

KnowledgeMiner optimizes several modeling tasks, so it is not possible to give exact values in advance. The real memory requirements may actually be smaller.

Q: We've purchased NGO and our company is windows based. I'll be doing KM at home on my Performa 6400/180. If I get the full version what kind of performance can I expect on the performa?

A: Two aspects: speed and RAM space. 180 MHz are good even for large problems. For small modeling problems (< 50 inputs and < 100 samples) it will take a few minutes and let's say 100KB-2MB of RAM temporarily to create a GMDH model (once familiar with it). However, especially RAM requirements will grow rapidly (10-100MB and more) with larger modeling problems (>100 inputs and > 500 samples). It can take then up to an hour or two to get a model. Compared to alternative methods with this kind of problem complexity, which would take days or weeks.

Q: How many records of data can I put into KM if I have 11 inputs and 6 outputs? Will I need a different data sheet for each output even if the input values are the same? Is copying and pasting the easiest way or save subsequent output entries with different names?

A: No, KM is "Un-PC" too! It can handle up to 500 inputs (including lagged variables for dynamic modeling) and a virtually unlimited number of outputs (read: models) in a single document using the same physical data sheet without copying/pasting any data. All models are stored in a model base and for each column of the sheet, 4 different model types can be created and stored simultaneously: a time series model (auto-regressive), an input-output model (static or dynamic), a system model (multi-input/multi-output) and an Analog Complexing model.

KM 3.0 has implemented a third modeling method: self-organizing fuzzy-rule induction or Fuzzy-GMDH. So, a fifth model can be added to the model base for each column. Also, KM 3 will extend the spreadsheet from actually 1,000 rows up to 10,000 rows.

Q: I've recently downloaded KM and I am wondering how it compares to NGO for windows. I'm going to compare models and "closeness" of fit between the two but I'm concerned the demo version will cut me out at 4 levels and not fit as well as it may have if I had the full potential of the full edition. What are advantages of KM over NGO or even more expensive software packages such as GenSym?

A: This has been described a little elsewhere in this FAQ. An important advantage is also that KM always produces a model description usable for interpretation and analysis. You can see why results are as they are and what variables KM has selected out as relevant. For fuzzy-rule induction, for example, you will get models in an almost natural language as this model from the wine recognition example shows:

IF N_Flavanoids & NOT_N_Nonflavanoid phenols & NOT_N_Color intensity
   OR NOT_N_Ash & N_OD280/OD315 of diluted wines
   OR NOT_N_Color intensity & NOT_P_Magnesium & N_Flavanoids
   & NOT_N_Alcalinity of ash & NOT_P_Hue
THEN wine_cultivar #3

The main difference, however, is that KM, in addition to the black-box approach and the connectionism of NNs, is based on a third principle called inductive self-organization.

Inductive self-organizing modeling theory and praxis have proven that "closeness of fit" cannot be the only criterion for finding a "best" model. It is necessary, during modeling, to validate each model candidate's performance on some new data. If this step is missing (as commonly seen in neuro-fuzzy-genetic approaches), models will tend inherently to be overfitted. This is, because it is always possible (at least theoretically) to formulate a model that fits any given (finite) learning data set with almost 100% accuracy - driven by the rule "the more complicated the model is, the more accurately it will fit the given data." This is also true for completely random samples. For noisy data, this means that, at a certain point in modeling, the model begins to fit the noise (overfitting), which results in bad or catastrophic performance on new data. The model fits better the design data, but at the same time, it loses accuracy when applied to some previously unseen data. It is too complex. So, the problem is to find that point where a model begins to reflect random relations. This we call creating an optimally complex model. GMDH can do this.

Contact:
knowledgeminer@iworld.to julian@scriptsoftware.com

Date Last Modified: 03/23/99